Group number: Group 13 Group Members: Shehroz Sohail, Amitabh Singh Virk, Ajna A Rivera

hollywood.png

Context:

In the last few years, the rise in the internet has increased the globalisation in every market be it e-commerce or entertainment. Platforms like Youtube and Facebook have made it easier than ever to increase the market reach for all creators and customers. This trend has also affected the Hollywood industry. Following the rise in new technologies and advancement in computing power, Hollywood is investing more than ever in data analytics. Day by day, more and more entertainment companies are hiring data professionals to support the business decisions based on data insights.

A very well named distribution house, Big Time Hollywood Production Company, is interested in distributing movies. Due to the rise seen in recent years in the popularity of Hollywood movies across the globe, they are planning on expanding their horizons outside the US. They have been quite successful in the past and want to keep the winning streak alive along with expansion of the Business.

They have hired a team of data scientists to help them make informed decisions regarding the movies to distribute. They want to get insights from the factors affecting the success of the previous movies.

Objective:

As data scientists our objective will be to gather the relevant data and create a presentation for the stakeholders with our findings that can benefit the expansion project.

We will follow the process defined in the steps below:

Step 1: Data Gathering
First step for any successful analysis is the gathering of the data. Our primary dataset is an IMDb dataset on Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+title_principals.csv
We also use data directly from IMDb and from a financial website to supplement this (see below for references).

Step 2: Data Cleaning
This step involves us working on the data to bring it in the form that is suitable for our analysis. This is one of the most crucial step for analysis and will act as the building block for our report.

Step 3: EDA(Exploratory Data Analysis)
In this step we uncover the underlying structure and modeling of the data. Here we dissect the data, we interpret in different forms and create a story with the help of statistical weapons in our arsenal.

Step 4: Findings and Insights
In this part we will find the patterns, look for the insights and make recommendations based on all the evidence that we have collected so far.

Questions we are trying to answer:

  1. How does this data relate to the business and how does it justify our objective?
  2. What factors affect the revenues made by previous films?
  3. Does the budget of the movie affect the success of the movie?
  4. What factors lead to the not so good ratings and good ratings?
  5. If a movie does well in one market is it likely to do well in others?

Assumptions:

  1. Our primary data set was collected from Kaggle and IMDB is used to verify certain aspects of the data.
  2. This analysis is intended to be used by the business to make decisions.
  3. All the recommendations and analysis are evidence based.

Stake Holder

Big Time Hollywood Production Company

Importing Libraries and Data Set

to download files into the session

download.file('https://drive.google.com/uc?export=download&id=14ufjboo68cDcV1wXGYN9GrGlw9iMmYwo' ,'IMDbratings') ratings <- read.csv('/content/IMDbratings') download.file('https://drive.google.com/uc?export=download&id=1GIqfLMHlI-q6Y7JNtENpf8KAsGCggw7t' ,'IMDbmovies') movies <- read.csv('/content/IMDbmovies') download.file('https://drive.google.com/uc?export=download&id=1TSf0qGS2MJxjwYciX9lE4SEWFrHOLF-K' ,'tick.inflation') tick.inflation <- read.csv('/content/tick.inflation') download.file('https://drive.google.com/uc?export=download&id=1i02NwyXQ_syqqAEo4WmU_ZXsnS8B3jF1' ,'rates') rates <- read.csv('/content/rates', na.strings=c("","NA"))

Cleaning Data

Adaptation

Data Exploration and Descriptive Visualizations

In our first set of Descriptive Visualizations, we focus solely on raw data exploration in our largely unfiltered dataset. Additional Descriptive Visualizations are in our Data Analysis section.

We look at:

Number of Movies per country

This section explores the global distribution of the movie production industry. We find that the US is the primary driver of this industry.

This creates a new dataframe with number of movies per country

Number of Movies per year

We find that the numbers of movies in the IMDb has gone down over the last few years.

Number of Movies in Genre

Movies in more niche genres will have fewer datapoints associated with them. For any genre-level analyses we will want to only focus on the most represented genres.

Average rating per Genre

Number of Movies made by a director

Since this data has been examined in a previous analysis (see Report section for details), we will not analyze it further. It is interesting to note that some directors are much more prolific than others.

Response Variable

(See Analysis section below for code that generated the ROI variable)

The response variables we focused on in this study were return on investment (ROI, called netratio in our dataset) and average vote.

ROI is a unitless variable we calculated ourselves that expresses income normalized by budget. Our original dataset had 2 types of income data for most films - gross income in the US and gross income worldwide (all countries except the US). This data was not corrected for inflation. The dataset also had budget data for some films. This budget data was expressed in multiple currencies. To get all of the data into the most prevalent currency (USD) we used a table of exchange rates for each currency for each year in our dataset (e.g, Indian Rupees to USD conversion rate in 1995). After conversion, we now had a budgetinUSD column that showed budget in a single currency, not corrected for inflation. We assumed that the budget and income for each movie were recorded in similar time periods (same year or adjacent years). Therefore we calculated ROI as domestic+worldwide income divided by budget. This gave us a unitless number that did not need to be corrected for inflation. (edited) Average vote (avg_vote) was a metric that came with our dataset that simply combines all user data to give a single number of "popularity" for each film. We use this as a popularity measure, assuming that films with higher popularity have higher viewership and desirability, and thus more valuable to streaming services.

Analysis

Explore Correlations and additional Descriptive Visualization

includes first two linear regressions
We first found that our dataset did not correct for inflation (using linear regression and a plot with abline visualization). We already knew that it contained budget numbers in multiple currencies. To address this, we:

  1. Created a new column normalizing income to annual ticket price data from IMDb
  2. Used an annual exchange-rate table to get all numbers into US Dollars (by year, since exchance-rates can vary).
  3. Created the ROI column to express income in terms of budget for each film.

We then checked that ROI was not sensitive to year using a linear regression as well as a plot with an abline

This creates a column for income (already in USD) normalized for annual ticket price (from IMDb). We removed outliers by filtering for movies with income under the gross income record of $3.7 billion adjusted for inflation ( Reference)

Created the unitless ROI measure

Correlation exploration:

  1. Make set of only numerical columns
  2. Heat map of correlations between all numerical columns
  3. Heat map of selected set of numerical columns

Selecting columns for further analysis

  1. Use weighted ratings and delete raw ratings
  2. Delete columns of numbers of ratings
  3. Delete columns not corrected for inflation

Further analysis of the columns in the heatmap above
Pairwise correlations by clusters in the heatmap

Income cluster:

ROI is not normally distributed, but the other measures are all pretty normal

Multiple Linear Regressions

1st MLR Model

MLR model #1: examines the effect of year, duration, average rating, raw budget (not inflation or currency adjsted), us income, worldwide income, median vote, weighted ratings for 5 demographic categories, income, income adjusted for ticket price, and adjusted budget (in USD, not adjusted for inflation). This model is not predictive with an $R^2$ of close to 0 (0.01). It does find a significant negative correlation between ROI and budget, and a signficant positive correlation between ROI and two income measures - USA gross income (not inflation corrected) and income corrected for ticket prices (inflation marker).
There is also a signficant negative correlation between movie duration and ROI.

2nd MLR Model

MLR model #2: looks at the same metrics as above but filters for movies with ROI less than 20 (makes less than 20X the initial budget). This captures over 97% of films with budget and income data and leaves out a small number of films mostly with incredibly small budgets but high reported incomes (see plot below) Without these outliers, we get an $R^2$ of 45%.

3rd MLR model

Realistically we don't know how much a movie will make before we distribute it
Here we make a model based only on data we can get prior to release: voting data from focus groups/screenings, budget data, and duration of movie.

Now we get a low $R^2$ value of only about 12%. This indicates that while our priors do influence ROI, they are not good predictive measures.

This is a MLR with ROI as the response variable.

Prediction of movie popularity: streaming services metric

We cannot reliably predict ROI based on a historical dataset. However, the ROI of films nowadays is much more complex than that of films in the past. In particular, whether a film is picked up by a streaming service can make or break it.
In this analysis, we assume that movies that are the most popular in focus groups/screenings will be more likely to be picked up by streaming services.

In a multiple linear regression, we find that several vote measures are significantly explanatory for average vote ($R^2$ = 98%). Budget in USD, duration, and year do not explain average vote (p>0.01)
By performing single linear regressions, we find that individual demogrphic groups perform as well or nearly as well as the multiple linear regression ($R^2$ of 89-98%). See below for group-by-group analysis.


Big question: whose vote matters? We tackled this question overall (any demographic group is reliable) as well as for specific genres (different demographic groups perform better for some genres).

Single Linear Regression Analyses (3 to 8)

Here we look to see which demographic groups on their own are best at predicting the average vote for a film

Weighted votes for all demographic groups were significantly explanatory for the average vote. This tell us that we do not need to seek specific demographic groups to find movies that will be loved! This greatly simplifies pre-screen data collection.

3D plot for Male femal and average votes

newplot (3).png

What kind of movie should we make and what production house should we collaborate with?

This compares between genre

This produces heat maps for correlations in each genre looking only at data that can be collected prior to movie release and ROI

Multiple Linear Regression by Genre


We find that ROI is not highly predictable, even only looking at a single genre. However, it is much more predictable than looking at movies overall. Furthermore, some genres are more predictable than others.
For example, using our set of priors, we get an $R^2$ of 59% for Biographical films. Making them a "safer bet" for investment.

Overall findings of which groups perform best at predicting ratings:

Some basic comparisons between the genres

While Biography is a "safe" genre in that it's priors are somewhat predictive of it's ROI, we need more information to make an informed decision.

We find that Biographies and Comedies have the best ROI and Biographies have, by far, the best average ratings

Which production company? We find that Universal has the second highest ROI and the second highest average ratings. Making it a good all-around pick for ticket sales and streaming services.

Final Report

The objective of the analysis is to find metrics that will streamline the decision making process for the distributor, Big Time Hollywood Company. We started with the gathering of data. For this purpose we used IMDB, exchange rate data, ticket prices over time, and custom surveys.

Because our dataset includes many older films with directors and actors who are no longer in the business, we decided to focus on the numerical data that would be steady (or we could normalize) year to year. The dataset was also incomplete, with many NA values scattered throughout. We cleaned the data by removing unnecessary rows and columns, NA values and created new columns for our analysis. Some columns with numeric values had characters in them (like $ and ,) so we changed those to the suitable data types.

We created new columns of weighted votes by demographic groups to see the effect of the opinions of different groups on our response variables. We also added new columns called income and netratio (an ROI measure). ROI is our primary response variable because we want to see what factors affect the return on investment the most. We created the ROI variable by first converting all data into USD, using exchange rates for each currency in each year. Then we divided total income Our original data set had income and budget of the movies and we created a unitless column ROI by dividing the Income by the Budget in USD.

This ROI measure is unitless and self-corrected for inflation, since income and budget should be near in time. We know that inflation is an issue in our dataset because we checked for a relationship between the year and income columns with our first linear regression model. We clearly see that income goes up as year goes up, suggesting that the dataset is not corrected for inflation. We also addressed this with a ticketinf column that only looks at income normalized for ticket prices during that time period.

Major findings:

  1. In our quest to uncover the trends and insights we found that the US is the country with the highest number of movies in the database followed by the UK, France, India and Germany

  2. But we found something surprising in that the number of movies over the last few years have been decreasing. We did not address this finding in our analysis.

  3. For the top-five most represented movie genres, Comedy leads the list followed by Action, Drama and Crime. Before continuing with the analysis we want to confirm something very important.

  4. We find strong correlations between the rating patterns of all of our demographic groups, suggesting we can save money in focus groups testing.

  5. Strong performance in the US is correlated with strong performance worldwide. Therefore capturing the KPI’s in the US might work for all over release as well.

  6. A correlation is present between income measures and budget of the movie. I.e. Having a high production budget often means a higher income for the movie, however budget was not correlated with ROI.

  7. Duration does not have significant correlation with any of the measures we are trying to study.

  8. Surprisingly, ROI does not have a strong correlation with any votes metric.

  9. Votes and income measures of the movie has a only 43% effect on the ROI.

  10. We increased the efficiency of our model by removing the outliers in our data, specifically movies with low budget and very high income. Removing these outliers takes our linear regression model from $R^2$ = .012 to $R^2$ =.45.

  11. Top three production houses in terms of numbers of movies made are MGM, Warner bros and Universal Studios

  12. The best performing genres in terms of ROI are Comedy, Drama and Action

  13. Comedy movies have the highest ROI because of their low budget and high income.

Historical data doesn’t give us significant insights for our ROI and how we can precisely predict it. In recent years the movie business has been very widely affected by the introduction of the streaming services. We have tried to incorporate the effect caused by the rise of digital media. Streaming services are more likely to pick the movies with the most positive ratings. We have voting data from multiple demographic groups in our data set and we want to see what age group votings reflect the most voting trends in general. This can help in reducing the focus group expenses and thus the results can be used by the streaming services to pick up the movies.

Findings on voting trends based on different age groups

  1. We find that all male and all female voting groups perform extremelly well on predicting the average vote.

  2. For specific genres, some age-groups perform very well on predicting the average vote. For example 18-30 year olds are excellent at predicting the average vote for Comedy and Drama films

  3. Importantly, we find that the Biography genre has excellent average ratings as well as a decent average ROI. Making it a good genre choice for our distributor.

Answers to our initial questions:

1. How does this data relate to the business and how does it justify our objective?

This data has many columns like votes, budget, income, demography etc that can be used to understand the mechanics of successful movies.

2.What factors affect the revenues made by previous films?

So far based on our analysis we have found that budget, good reviews and genre affects the revenue the most. However, revenue is very hard to predict.

3.Does the budget of the movie affect the success of the movie?

Yes it does have a significant correlation with the revenue generated by movie but there are other factors that affect the success as well.

4.What factors lead to the not so good ratings and good ratings?

Genre and Production quality (Budget) are the factors that affect the ratings.

5.If a movie does well in one market is it likely to do well in others?

We have seen in our analysis that the US produces the most movies and surprisingly if the movie does well in the US it is more likely to do so worldwide as well.

Additional data collection:

This data is pre-pandemic and we believe that people today are even more likely to use streaming services (where voter rating is king) than ever. We did our own survey to find out about people's movie watching habits right now.

To get this information we created the survey with the following questions with all the collected responses:

Screen Shot 2021-09-27 at 11.52.56 PM.png

Screen Shot 2021-09-27 at 11.53.07 PM.png

Based on the responses above we can clearly see the post-pandemic audience's preference for movies to be seen via streaming services rather than theatres. Only 7% of the sample population said they're comfortable in theaters.

We also get some additional insights from the survey as listed below:

  1. People like to watch movies in foreign language as long as we have captions or are dubbed in the regional language.

  2. The most loved streaming services are Netlix, Amazon Prime and HBO Max.

Now that we have established the importance of the streaming services we need to get some more information on how the streaming services are performing in general. What are the predicted growth rates in future and how can we incorporate them in the future business model of BIG TIME HOLLYWOOD PRODUCTION COMPANY

Screen Shot 2021-09-28 at 10.15.26 AM.png

Screen Shot 2021-09-28 at 10.15.38 AM.png

We can clearly see the upward trend in the streaming services revenue, more revenue means more customers, more customers means more devices and in a click the number of devices the movies will reach is in millions, which far higher than the foot fall that can be expected from movie theatres. Below is a chart showing the number of subscribers for each streaming service

Screen Shot 2021-09-28 at 10.27.31 AM.png

Final Recommendations

Based on all the analysis above and considering all the possible outcomes we would recommend to the business owner that a movie in genre Comedy will reduce budget thus increase ROI, while a movie in the genre Biography will increase voter ratings, while keeping ROI still fairly high. Our distributor should partner with production houses Universal studios or MGM, having a screening enriched for the age group 30-45 released simultaneously in theatres and streaming services.

Originality

Previous analyses of this dataset primarily focus on the existing columns, without adding in ROI, inflation, etc. However, one analysis we found does examine ROI (but not looking at multiple currencies). This analysis focuses on ROI by director and actor popularity. It examined metrics like facebook likes and fame of the actors pictured in the posters. It is a very different tactic than the one we took, which is more heavily based on pre-release data than pre-production data. Our data is time-neutral in that does not consider active directors and actors, but rather looks at measures where we can directly compare historical data to recent data. This increases the number of observations, making us less prone to error.